Below is the structure of the human dataset. The following characteristics of the dataframe can be discerned.
## [1] 155 8
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Here we’ll print out the summary of the data with the summary() function to get a grasp of the min, max, median, mean and quantiles of the data.
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Here are the standard deviations of the variables.
## sd(Edu2.FM) sd(Labo.FM) sd(Edu.Exp) sd(Life.Exp) sd(GNI) sd(Mat.Mor)
## 1 0.2416396 0.1987786 2.840251 8.332064 18543.85 211.7896
## sd(Ado.Birth) sd(Parli.F)
## 1 41.11205 11.48775
Let’s visualize our data to get a better overall picture of it. First we’ll produce a matrix plot with the basic package’s pairs() and then with GGally package’s ggpairs().
Now it’s time to produce a table of the correlations with the cor() function. Here the correlations were rounded to two desimals to save space.
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53 0.08
## Labo.FM 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
## Edu.Exp 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
## Life.Exp 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
## GNI 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
## Mat.Mor -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
## Parli.F 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00
Here’s the visualization of the correlation matrix with the advanced corrplot() function. To reduce repetition, we’ll visualize only the upper part of the plot (as is well known, the top part of the correlation matrix contains the same correlations as the bottom part)
Here are some of the exposed correlations
Next we’ll perform principal component analysis (PCA) on the not standardized human data and show variability captured by the principal components.
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
Then we’ll draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
There’s something wrong with this PCA and it’s plot. The first principal (PC1) component explain 100% of the variance and the following principal components (PC2-PC8) explain 0 %. The only variable name shown is GNI which is connected to the first principal component. PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. The human dataset’s GNI variable has a radically bigger scale and thus bigger variance than other variables (the long arrow also tells us there’s a quite big stadard variation within this variable). This is why this PCA with non-standardized variables failed miserably.
Let’s fix this problem by standardizing the data before using it in the principal component analysis.
Here are the summaries of the scaled variables. See how the variables changed ( e.g. the means are now all at zero). As we can see below, the distribution of explainability is now more spread among the PC’s. The PCA plot also makes now a lot more sense.
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
Let’s take a closer look at the countries on the “west side” of the biplot and close to PC2.
Let’s interpret the results of both analysis and their corresponding biblots The biplot that was plotted from the non-standardized data (the one with the blue arrow) was not very informative, as we learnt above. The second biplot based on the standardized variables on the contrary offers a lot of interesting and visible information.
Intepretation of PC1
Generally speaking, the 1st principal component captures the maximum amount of variance from the features in the original data. Here the amount of variance of the data captured by PC1 is 53.6 %. The variables/features connected to the PC1 dimension are Mat.Mor (maternal mortality) and Ado.Birth (adolescent birth) pointing their arrows horizontally to the right and Edu.Exp, Life.Exp, Edu2.FM and GNI pointing their arrows horizontally to the left (Mat.Mor and Ado.Birth have a strong negative correlation with Edu.Exp, Life.Exp, Edu2.FM and GNI, as I explained above). The countries on the right end of the PC1’s horizontal axis are mostly poor African countries with low education connected variable values and on the opposite side (left) rich European and Asian countries (+ the USA) with high education connected variable values.
Intepretation of PC2
The 2nd principal component PC2 is orthogonal to the first and it captures the maximum amount of variability/variance left. Here that amount is 16.2 %. PC2 describes how actively women take part in the political sphere and the working life of the society they live in. Many Arab states are located at the low end of the vertical PC2 axis shown in the plot.
Next we’ll load the tea dataset from the package Factominer and explore the data briefly.
Let’s look at the structure and the dimensions of the data first. Then we’ll create a subset of it by selecting the following variables.
## [1] "Sport" "effect.on.health" "sophisticated"
## [4] "spirituality" "friends" "sex"
## [1] 300 36
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] "breakfast" "tea.time" "evening"
## [4] "lunch" "dinner" "always"
## [7] "home" "work" "tearoom"
## [10] "friends" "resto" "pub"
## [13] "Tea" "How" "sugar"
## [16] "how" "where" "price"
## [19] "age" "sex" "SPC"
## [22] "Sport" "age_Q" "frequency"
## [25] "escape.exoticism" "spirituality" "healthy"
## [28] "diuretic" "friendliness" "iron.absorption"
## [31] "feminine" "sophisticated" "slimming"
## [34] "exciting" "relaxing" "effect.on.health"
## 'data.frame': 300 obs. of 6 variables:
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## Sport effect.on.health sophisticated
## Not.sportsman:121 effect on health : 66 Not.sophisticated: 85
## sportsman :179 No.effect on health:234 sophisticated :215
## spirituality friends sex
## Not.spirituality:206 friends :196 F:178
## spirituality : 94 Not.friends:104 M:122
Let’s do the multiple correspondence analysis of selected tea variables.
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.225 0.175 0.169 0.159 0.140 0.132
## % of var. 22.474 17.492 16.890 15.938 14.029 13.176
## Cumulative % of var. 22.474 39.967 56.857 72.795 86.824 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.941 1.315 0.730 | 0.294 0.165 0.071 | -0.407
## 2 | 0.561 0.468 0.290 | -0.123 0.029 0.014 | -0.554
## 3 | 0.389 0.225 0.176 | -0.691 0.909 0.555 | -0.367
## 4 | -0.333 0.165 0.087 | 1.007 1.933 0.791 | -0.108
## 5 | 0.599 0.532 0.239 | 0.722 0.993 0.347 | -0.402
## 6 | 0.941 1.315 0.730 | 0.294 0.165 0.071 | -0.407
## 7 | 0.769 0.878 0.599 | -0.273 0.142 0.076 | -0.220
## 8 | 0.069 0.007 0.007 | 0.119 0.027 0.019 | -0.392
## 9 | 0.449 0.300 0.235 | 0.536 0.548 0.335 | -0.245
## 10 | 0.501 0.373 0.186 | 0.337 0.216 0.084 | -0.274
## ctr cos2
## 1 0.326 0.136 |
## 2 0.605 0.282 |
## 3 0.266 0.157 |
## 4 0.023 0.009 |
## 5 0.318 0.107 |
## 6 0.326 0.136 |
## 7 0.095 0.049 |
## 8 0.304 0.211 |
## 9 0.119 0.070 |
## 10 0.149 0.056 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## Not.sportsman | -0.747 16.695 0.377 -10.621 | 0.064 0.158
## sportsman | 0.505 11.285 0.377 10.621 | -0.043 0.107
## effect on health | 0.342 1.912 0.033 3.143 | -0.007 0.001
## No.effect on health | -0.097 0.539 0.033 -3.143 | 0.002 0.000
## Not.sophisticated | 1.003 21.145 0.398 10.907 | -0.436 5.129
## sophisticated | -0.397 8.359 0.398 -10.907 | 0.172 2.028
## Not.spirituality | 0.305 4.742 0.204 7.811 | -0.336 7.405
## spirituality | -0.669 10.392 0.204 -7.811 | 0.737 16.229
## friends | -0.170 1.396 0.054 -4.029 | -0.494 15.163
## Not.friends | 0.320 2.631 0.054 4.029 | 0.930 28.577
## cos2 v.test Dim.3 ctr cos2 v.test
## Not.sportsman 0.003 0.912 | 0.195 1.506 0.026 2.765 |
## sportsman 0.003 -0.912 | -0.131 1.018 0.026 -2.765 |
## effect on health 0.000 -0.067 | 1.762 67.420 0.876 16.184 |
## No.effect on health 0.000 0.067 | -0.497 19.016 0.876 -16.184 |
## Not.sophisticated 0.075 -4.739 | -0.285 2.271 0.032 -3.099 |
## sophisticated 0.075 4.739 | 0.113 0.898 0.032 3.099 |
## Not.spirituality 0.248 -8.612 | -0.004 0.001 0.000 -0.096 |
## spirituality 0.248 8.612 | 0.008 0.002 0.000 0.096 |
## friends 0.459 -11.716 | 0.160 1.640 0.048 3.787 |
## Not.friends 0.459 11.716 | -0.301 3.092 0.048 -3.787 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Sport | 0.377 0.003 0.026 |
## effect.on.health | 0.033 0.000 0.876 |
## sophisticated | 0.398 0.075 0.032 |
## spirituality | 0.204 0.248 0.000 |
## friends | 0.054 0.459 0.048 |
## sex | 0.282 0.265 0.032 |
This proportional barplot confirms that women drink tea more with friends than men do (which was also suggested by the MCA-plot above).
The above MCA-biplot showed that women regard tea drinking more than men as sophisticated. This finding is confirmed in the barplot below.
It’s a good idea to introduce the variables of the human dataset before moving further.
Here are the links to the metadata of the Human Development Index dataset.
Below is the structure of the human dataset. The following characteristics of the dataframe can be discerned.
## [1] 155 8
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Here we’ll print out the summary of the data with the summary() function to get a grasp of the min, max, median, mean and quantiles of the data.
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Here are the standard deviations of the variables.
## sd(Edu2.FM) sd(Labo.FM) sd(Edu.Exp) sd(Life.Exp) sd(GNI) sd(Mat.Mor)
## 1 0.2416396 0.1987786 2.840251 8.332064 18543.85 211.7896
## sd(Ado.Birth) sd(Parli.F)
## 1 41.11205 11.48775
Let’s visualize our data to get a better overall picture of it. First we’ll produce a matrix plot with the basic package’s pairs() and then with GGally package’s ggpairs().
Now it’s time to produce a table of the correlations with the cor() function. Here the correlations were rounded to two desimals to save space.
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53 0.08
## Labo.FM 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
## Edu.Exp 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
## Life.Exp 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
## GNI 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
## Mat.Mor -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
## Parli.F 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00
Here’s the visualization of the correlation matrix with the advanced corrplot() function. To reduce repetition, we’ll visualize only the upper part of the plot (as is well known, the top part of the correlation matrix contains the same correlations as the bottom part)
Here are some of the exposed correlations
Next we’ll perform principal component analysis (PCA) on the not standardized human data and show variability captured by the principal components.
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
Then we’ll draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
There’s something wrong with this PCA and it’s plot. The first principal (PC1) component explain 100% of the variance and the following principal components (PC2-PC8) explain 0 %. The only variable name shown is GNI which is connected to the first principal component. PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. The human dataset’s GNI variable has a radically bigger scale and thus bigger variance than other variables (the long arrow also tells us there’s a quite big stadard variation within this variable). This is why this PCA with non-standardized variables failed miserably.
Let’s fix this problem by standardizing the data before using it in the principal component analysis.
Here are the summaries of the scaled variables. See how the variables changed ( e.g. the means are now all at zero). As we can see below, the distribution of explainability is now more spread among the PC’s. The PCA plot also makes now a lot more sense.
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
Let’s take a closer look at the countries on the “west side” of the biplot and close to PC2.
Let’s interpret the results of both analysis and their corresponding biblots The biplot that was plotted from the non-standardized data (the one with the blue arrow) was not very informative, as we learnt above. The second biplot based on the standardized variables on the contrary offers a lot of interesting and visible information.
Intepretation of PC1
Generally speaking, the 1st principal component captures the maximum amount of variance from the features in the original data. Here the amount of variance of the data captured by PC1 is 53.6 %. The variables/features connected to the PC1 dimension are Mat.Mor (maternal mortality) and Ado.Birth (adolescent birth) pointing their arrows horizontally to the right and Edu.Exp, Life.Exp, Edu2.FM and GNI pointing their arrows horizontally to the left (Mat.Mor and Ado.Birth have a strong negative correlation with Edu.Exp, Life.Exp, Edu2.FM and GNI, as I explained above). The countries on the right end of the PC1’s horizontal axis are mostly poor African countries with low education connected variable values and on the opposite side (left) rich European and Asian countries (+ the USA) with high education connected variable values.
Intepretation of PC2
The 2nd principal component PC2 is orthogonal to the first and it captures the maximum amount of variability/variance left. Here that amount is 16.2 %. PC2 describes how actively women take part in the political sphere and the working life of the society they live in. Many Arab states are located at the low end of the vertical PC2 axis shown in the plot.
Next we’ll load the tea dataset from the package Factominer and explore the data briefly.
Let’s look at the structure and the dimensions of the data first. Then we’ll create a subset of it by selecting the following variables.
## [1] "Sport" "effect.on.health" "sophisticated"
## [4] "spirituality" "friends" "sex"
## [1] 300 36
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] "breakfast" "tea.time" "evening"
## [4] "lunch" "dinner" "always"
## [7] "home" "work" "tearoom"
## [10] "friends" "resto" "pub"
## [13] "Tea" "How" "sugar"
## [16] "how" "where" "price"
## [19] "age" "sex" "SPC"
## [22] "Sport" "age_Q" "frequency"
## [25] "escape.exoticism" "spirituality" "healthy"
## [28] "diuretic" "friendliness" "iron.absorption"
## [31] "feminine" "sophisticated" "slimming"
## [34] "exciting" "relaxing" "effect.on.health"
## 'data.frame': 300 obs. of 6 variables:
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## Sport effect.on.health sophisticated
## Not.sportsman:121 effect on health : 66 Not.sophisticated: 85
## sportsman :179 No.effect on health:234 sophisticated :215
## spirituality friends sex
## Not.spirituality:206 friends :196 F:178
## spirituality : 94 Not.friends:104 M:122
Let’s do the multiple correspondence analysis of selected tea variables.
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.225 0.175 0.169 0.159 0.140 0.132
## % of var. 22.474 17.492 16.890 15.938 14.029 13.176
## Cumulative % of var. 22.474 39.967 56.857 72.795 86.824 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.941 1.315 0.730 | 0.294 0.165 0.071 | -0.407
## 2 | 0.561 0.468 0.290 | -0.123 0.029 0.014 | -0.554
## 3 | 0.389 0.225 0.176 | -0.691 0.909 0.555 | -0.367
## 4 | -0.333 0.165 0.087 | 1.007 1.933 0.791 | -0.108
## 5 | 0.599 0.532 0.239 | 0.722 0.993 0.347 | -0.402
## 6 | 0.941 1.315 0.730 | 0.294 0.165 0.071 | -0.407
## 7 | 0.769 0.878 0.599 | -0.273 0.142 0.076 | -0.220
## 8 | 0.069 0.007 0.007 | 0.119 0.027 0.019 | -0.392
## 9 | 0.449 0.300 0.235 | 0.536 0.548 0.335 | -0.245
## 10 | 0.501 0.373 0.186 | 0.337 0.216 0.084 | -0.274
## ctr cos2
## 1 0.326 0.136 |
## 2 0.605 0.282 |
## 3 0.266 0.157 |
## 4 0.023 0.009 |
## 5 0.318 0.107 |
## 6 0.326 0.136 |
## 7 0.095 0.049 |
## 8 0.304 0.211 |
## 9 0.119 0.070 |
## 10 0.149 0.056 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## Not.sportsman | -0.747 16.695 0.377 -10.621 | 0.064 0.158
## sportsman | 0.505 11.285 0.377 10.621 | -0.043 0.107
## effect on health | 0.342 1.912 0.033 3.143 | -0.007 0.001
## No.effect on health | -0.097 0.539 0.033 -3.143 | 0.002 0.000
## Not.sophisticated | 1.003 21.145 0.398 10.907 | -0.436 5.129
## sophisticated | -0.397 8.359 0.398 -10.907 | 0.172 2.028
## Not.spirituality | 0.305 4.742 0.204 7.811 | -0.336 7.405
## spirituality | -0.669 10.392 0.204 -7.811 | 0.737 16.229
## friends | -0.170 1.396 0.054 -4.029 | -0.494 15.163
## Not.friends | 0.320 2.631 0.054 4.029 | 0.930 28.577
## cos2 v.test Dim.3 ctr cos2 v.test
## Not.sportsman 0.003 0.912 | 0.195 1.506 0.026 2.765 |
## sportsman 0.003 -0.912 | -0.131 1.018 0.026 -2.765 |
## effect on health 0.000 -0.067 | 1.762 67.420 0.876 16.184 |
## No.effect on health 0.000 0.067 | -0.497 19.016 0.876 -16.184 |
## Not.sophisticated 0.075 -4.739 | -0.285 2.271 0.032 -3.099 |
## sophisticated 0.075 4.739 | 0.113 0.898 0.032 3.099 |
## Not.spirituality 0.248 -8.612 | -0.004 0.001 0.000 -0.096 |
## spirituality 0.248 8.612 | 0.008 0.002 0.000 0.096 |
## friends 0.459 -11.716 | 0.160 1.640 0.048 3.787 |
## Not.friends 0.459 11.716 | -0.301 3.092 0.048 -3.787 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Sport | 0.377 0.003 0.026 |
## effect.on.health | 0.033 0.000 0.876 |
## sophisticated | 0.398 0.075 0.032 |
## spirituality | 0.204 0.248 0.000 |
## friends | 0.054 0.459 0.048 |
## sex | 0.282 0.265 0.032 |
This proportional barplot confirms that women drink tea more with friends than men do (which was also suggested by the MCA-plot above).
The above MCA-biplot showed that women regard tea drinking more than men as sophisticated. This finding is confirmed in the barplot below.